Colwellia Metapangenome

Following http://merenlab.org/2019/03/14/ncbi-genome-download-magic/

Download Colwellia genomes from NCBI.

Process metadata.txt to download NCBI genomes

Before running the next line, make sure you add in the fasta.txt file those genomes that were not automatically downloaded in the previous line (e.g. SAGs, MAGs, or unpublished genomes).

Here we include three additional MAGs Colwellia MAG1 Sieradzki et al 2021, Colwellia MAG2 CampeĂŁo, et al 2019, Colwellia MAG3 Borchert et al 2021

Here it is needed to run the next lines to reduce probability of errors from Anvio

Run this, if anvio is newly installed.

anvi-setup-scg-taxonomy

and

anvi-setup-ncbi-cogs

Then, the following line was run.

anvi-run-workflow -w contigs -c default_config.txt

Now creating external_genomes.txt file

Now creating the database of Colwellia genomes.

Now we can run the pangenomic analysis

anvi-pan-genome -g COLWELLIA_GENOMES.db --project-name "Colwellia" --output-dir COLWELLIA --num-threads 6 --minbit 0.5 --mcl-inflation 10 --use-ncbi-blast --enforce-hierarchical-clustering

To display the pangenome.

anvi-display-pan -p COLWELLIA/Colwellia-PAN.db -g COLWELLIA_GENOMES.db

Selection of a genomic reference

The anvio pangenome analysis doesn't offer to generate a consensus reference from a set of genomes. Therefore, if we have 27 metatranscriptomic libraries and 73 genomes in the pangenome, this translates into 27 x 73 = 1971 mapping procedures (and ultimately 1971 BAM files), which is not feasible to plot or to process in a single figure. Therefore we need to chose only one genome as mapping reference for the metapangenomic analysis. We are going to choose the genome that recruits the largest amount of reads from our transcriptomic libraries.

Anvio requires that contig names are the same in the contig database as well as BAM files. To prevent future errors, we are going to export fasta files from the contig databases and they are going to be used for downstream processes.

mkdir 04_MAPPING

Then we submit the following script:

#!/bin/bash
#PBS -N test
#PBS -q joye_q
#PBS -m abe
#PBS -M montenegrotito@yahoo.com
#PBS -l nodes=1:ppn=1:AMD
#PBS -l walltime=10:00:00
#PBS -l mem=20gb
cd $PBS_O_WORKDIR
module load Bowtie2/2.3.4.1-foss-2016b
bowtie2-build NCBI_genomes_v2/Colwellia_MAG1-contigs.fa 04_MAPPING/Colwellia_MAG1_index
bowtie2-build NCBI_genomes_v2/Colwellia_MAG2-contigs.fa 04_MAPPING/Colwellia_MAG2_index
bowtie2-build NCBI_genomes_v2/Colwellia_MAG3-contigs.fa 04_MAPPING/Colwellia_MAG3_index
bowtie2-build NCBI_genomes_v2/Colwellia_SAG-contigs.fa 04_MAPPING/Colwellia_SAG_index
bowtie2-build NCBI_genomes_v2/Colwellia_ponticola_GCF_005885605_1-contigs.fa 04_MAPPING/Colwellia_ponticola_GCF_005885605_1_index
bowtie2-build NCBI_genomes_v2/Colwellia_sp__6M3_GCF_014077595_1-contigs.fa 04_MAPPING/Colwellia_sp__6M3_GCF_014077595_1_index
bowtie2-build NCBI_genomes_v2/Colwellia_sp__BRX10_1_GCF_014077545_1-contigs.fa 04_MAPPING/Colwellia_sp__BRX10_1_GCF_014077545_1_index
bowtie2-build NCBI_genomes_v2/Colwellia_sp__BRX10_2_GCF_014077515_1-contigs.fa 04_MAPPING/Colwellia_sp__BRX10_2_GCF_014077515_1_index
bowtie2-build NCBI_genomes_v2/Colwellia_sp__BRX10_3_GCF_014077505_1-contigs.fa 04_MAPPING/Colwellia_BRX10_3_GCF_014077505_1_index
bowtie2-build NCBI_genomes_v2/Colwellia_sp__BRX10_4_GCF_014077465_1-contigs.fa 04_MAPPING/Colwellia_sp__BRX10_4_GCF_014077465_1_index
bowtie2-build NCBI_genomes_v2/Colwellia_sp__BRX10_5_GCF_014077435_1-contigs.fa 04_MAPPING/Colwellia_sp__BRX10_5_GCF_014077435_1_index
bowtie2-build NCBI_genomes_v2/Colwellia_sp__BRX10_6_GCF_014077455_1-contigs.fa 04_MAPPING/Colwellia_sp__BRX10_6_GCF_014077455_1_index
bowtie2-build NCBI_genomes_v2/Colwellia_sp__BRX10_7_GCF_014077415_1-contigs.fa 04_MAPPING/Colwellia_sp__BRX10_7_GCF_014077415_1_index
bowtie2-build NCBI_genomes_v2/Colwellia_sp__BRX10_9_GCF_014077405_1-contigs.fa 04_MAPPING/Colwellia_sp__BRX10_9_GCF_014077405_1_index
bowtie2-build NCBI_genomes_v2/Colwellia_sp__BRX8_2_GCF_014077375_1-contigs.fa 04_MAPPING/Colwellia_sp__BRX8_2_GCF_014077375_1_index
bowtie2-build NCBI_genomes_v2/Colwellia_sp__BRX8_3_GCF_014077355_1-contigs.fa 04_MAPPING/Colwellia_sp__BRX8_3_GCF_014077355_1_index
bowtie2-build NCBI_genomes_v2/Colwellia_sp__BRX8_4_GCF_014077345_1-contigs.fa 04_MAPPING/Colwellia_BRX8_4_GCF_014077345_1_index
bowtie2-build NCBI_genomes_v2/Colwellia_sp__BRX8_5_GCF_014077325_1-contigs.fa 04_MAPPING/Colwellia_sp__BRX8_5_GCF_014077325_1_index
bowtie2-build NCBI_genomes_v2/Colwellia_sp__BRX8_6_GCF_014077305_1-contigs.fa 04_MAPPING/Colwellia_sp__BRX8_6_GCF_014077305_1_index
bowtie2-build NCBI_genomes_v2/Colwellia_sp__BRX8_7_GCF_014077275_1-contigs.fa 04_MAPPING/Colwellia_sp__BRX8_7_GCF_014077275_1_index
bowtie2-build NCBI_genomes_v2/Colwellia_sp__BRX8_9_GCF_014077245_1-contigs.fa 04_MAPPING/Colwellia_sp__BRX8_9_GCF_014077245_1_index
bowtie2-build NCBI_genomes_v2/Colwellia_sp__BRX9_1_GCF_014077195_1-contigs.fa 04_MAPPING/Colwellia_sp__BRX9_1_GCF_014077195_1_index
bowtie2-build NCBI_genomes_v2/Colwellia_sp__Bg11_12_GCF_014076955_1-contigs.fa 04_MAPPING/Colwellia_sp__Bg11_12_GCF_014076955_1_index
bowtie2-build NCBI_genomes_v2/Colwellia_sp__C2M11_GCF_018860915_1-contigs.fa 04_MAPPING/Colwellia_sp__C2M11_GCF_018860915_1_index
bowtie2-build NCBI_genomes_v2/Colwellia_sp__D2M02_GCF_018860755_1-contigs.fa 04_MAPPING/Colwellia_sp__D2M02_GCF_018860755_1_index
bowtie2-build NCBI_genomes_v2/Colwellia_sp__E2M01_GCF_018860585_1-contigs.fa 04_MAPPING/Colwellia_sp__E2M01_GCF_018860585_1_index
bowtie2-build NCBI_genomes_v2/Colwellia_sp__MB02u_10_GCF_014077185_1-contigs.fa 04_MAPPING/Colwellia_sp__MB02u_10_GCF_014077185_1_index
bowtie2-build NCBI_genomes_v2/Colwellia_sp__MB02u_11_GCF_014076925_1-contigs.fa 04_MAPPING/Colwellia_MB02u_11_GCF_014076925_1_index
bowtie2-build NCBI_genomes_v2/Colwellia_sp__MB02u_12_GCF_014077155_1-contigs.fa 04_MAPPING/Colwellia_sp__MB02u_12_GCF_014077155_1_index
bowtie2-build NCBI_genomes_v2/Colwellia_sp__MB02u_14_GCF_014076915_1-contigs.fa 04_MAPPING/Colwellia_sp__MB02u_14_GCF_014076915_1_index
bowtie2-build NCBI_genomes_v2/Colwellia_sp__MB02u_18_GCF_014077125_1-contigs.fa 04_MAPPING/Colwellia_sp__MB02u_18_GCF_014077125_1_index
bowtie2-build NCBI_genomes_v2/Colwellia_sp__MB02u_19_GCF_014077115_1-contigs.fa 04_MAPPING/Colwellia_sp__MB02u_19_GCF_014077115_1_index
bowtie2-build NCBI_genomes_v2/Colwellia_sp__MB02u_1_GCF_014077215_1-contigs.fa 04_MAPPING/Colwellia_sp__MB02u_1_GCF_014077215_1_index
bowtie2-build NCBI_genomes_v2/Colwellia_sp__MB02u_6_GCF_014077095_1-contigs.fa 04_MAPPING/Colwellia_sp__MB02u_6_GCF_014077095_1_index
bowtie2-build NCBI_genomes_v2/Colwellia_sp__MB02u_7_GCF_014076905_1-contigs.fa 04_MAPPING/Colwellia_sp__MB02u_7_GCF_014076905_1_index
bowtie2-build NCBI_genomes_v2/Colwellia_sp__MB02u_9_GCF_014077085_1-contigs.fa 04_MAPPING/Colwellia_sp__MB02u_9_GCF_014077085_1_index
bowtie2-build NCBI_genomes_v2/Colwellia_sp__MB3u_22_GCF_014076885_1-contigs.fa 04_MAPPING/Colwellia_sp__MB3u_22_GCF_014076885_1_index
bowtie2-build NCBI_genomes_v2/Colwellia_sp__MB3u_28_GCF_014076795_1-contigs.fa 04_MAPPING/Colwellia_sp__MB3u_28_GCF_014076795_1_index
bowtie2-build NCBI_genomes_v2/Colwellia_sp__MB3u_41_GCF_014076845_1-contigs.fa 04_MAPPING/Colwellia_sp__MB3u_41_GCF_014076845_1_index
bowtie2-build NCBI_genomes_v2/Colwellia_sp__MB3u_43_GCF_014076995_1-contigs.fa 04_MAPPING/Colwellia_sp__MB3u_43_GCF_014076995_1_index
bowtie2-build NCBI_genomes_v2/Colwellia_sp__MB3u_45_GCF_014077005_1-contigs.fa 04_MAPPING/Colwellia_sp__MB3u_45_GCF_014077005_index
bowtie2-build NCBI_genomes_v2/Colwellia_sp__MB3u_4_GCF_014077065_1-contigs.fa 04_MAPPING/Colwellia_sp__MB3u_4_GCF_014077065_1_index
bowtie2-build NCBI_genomes_v2/Colwellia_sp__MB3u_55_GCF_014076785_1-contigs.fa 04_MAPPING/Colwellia_sp__MB3u_55_GCF_014076785_1_index
bowtie2-build NCBI_genomes_v2/Colwellia_sp__MB3u_64_GCF_014076835_1-contigs.fa 04_MAPPING/Colwellia_sp__MB3u_64_GCF_014076835_1_index
bowtie2-build NCBI_genomes_v2/Colwellia_sp__MB3u_70_GCF_014077045_1-contigs.fa 04_MAPPING/Colwellia_sp__MB3u_70_GCF_014077045_1_index
bowtie2-build NCBI_genomes_v2/Colwellia_sp__MB3u_8_GCF_014076985_1-contigs.fa 04_MAPPING/Colwellia_sp__MB3u_8_GCF_014076985_1_index

Next, for each of the transcriptomic libraries we submit a mapping script. This script is in 99_SCRIPTSas t-sub.sh_map_MT_2_refgenomes STDOUT files are located in ``98_SCREENING```

In the next lines we are going to process the STDOUT files into a table.

Based on the average of mapped reads per library:

Colwellia_MAG2                                 495256.629630
Colwellia_psychrerythraea_34H_GCF_000012325    220714.518519
Colwellia_demingiae_GCF_007954275              219478.074074
Colwellia_sp__Bg11_28_GCF_002836245            218717.333333
Colwellia_psychrerythraea_GCF_000764225        160240.111111
Colwellia_MAG3                                 151193.703704
Colwellia_sp__12G3_GCF_002836775               139843.666667
Colwellia_sp__20A7_GCF_009832865               139278.777778
Colwellia_psychrerythraea_GCF_000764185        136558.703704
Colwellia_sp__75C3_GCF_002836255               134185.666667
dtype: float64

Based on the average of the greatest (maximum) number of mapped reads in a library :

Colwellia_MAG2                                 2219912
Colwellia_psychrerythraea_34H_GCF_000012325     840718
Colwellia_sp__Bg11_28_GCF_002836245             834618
Colwellia_demingiae_GCF_007954275               802817
Colwellia_psychrerythraea_GCF_000764225         591947
Colwellia_MAG3                                  520897
Colwellia_psychrerythraea_GCF_000764185         485183
Colwellia_sp__20A7_GCF_009832865                483193
Colwellia_sp__C2M11_GCF_018860915_1             466352
Colwellia_sp__12G3_GCF_002836775                465563
dtype: int64

Based on the average of mapping rates per library:

Colwellia_MAG2                                 4.146107
Colwellia_psychrerythraea_34H_GCF_000012325    1.895749
Colwellia_demingiae_GCF_007954275              1.892606
Colwellia_sp__Bg11_28_GCF_002836245            1.878863
Colwellia_psychrerythraea_GCF_000764225        1.403293
Colwellia_MAG3                                 1.339042
Colwellia_sp__20A7_GCF_009832865               1.250246
Colwellia_sp__12G3_GCF_002836775               1.249385
Colwellia_psychrerythraea_GCF_000764185        1.224811
Colwellia_sp__75C3_GCF_002836255               1.198097
dtype: float64

The largest yield of reads recruitment is obtained using the metagenomic assembled genome of Candidatus Colwellia aromaticivorans

Now I export the dataframes to prepare an excel sheet.

anvi BAM profile and downstream analysis

Now that we identified the best possible genomic reference, we can proceed to map. First we need the right contigs to do the mapping.

Check and runruncell19.sh

#!ls *.tgz | awk '{print "tar xzvf " $0}'



tar xzvf OIL11-RAW.bam.tgz
tar xzvf OIL14-RAW.bam.tgz
tar xzvf OIL17-RAW.bam.tgz
tar xzvf OIL25-RAW.bam.tgz
tar xzvf OIL28-RAW.bam.tgz
tar xzvf OIL29-RAW.bam.tgz
tar xzvf OIL31-RAW.bam.tgz
tar xzvf OIL32-RAW.bam.tgz
tar xzvf OIL34-RAW.bam.tgz
tar xzvf OIL37-RAW.bam.tgz
tar xzvf OIL44-RAW.bam.tgz
tar xzvf OIL47-RAW.bam.tgz
tar xzvf OIL5-RAW.bam.tgz
tar xzvf OIL50-RAW.bam.tgz
tar xzvf OIL53-RAW.bam.tgz
tar xzvf OIL61-RAW.bam.tgz
tar xzvf OIL64-RAW.bam.tgz
tar xzvf OIL67-RAW.bam.tgz
tar xzvf OIL70-RAW.bam.tgz
tar xzvf OIL78-RAW.bam.tgz
tar xzvf OIL8-RAW.bam.tgz
tar xzvf OIL81-RAW.bam.tgz
tar xzvf OIL82-RAW.bam.tgz
tar xzvf OIL84-RAW.bam.tgz
tar xzvf OIL85-RAW.bam.tgz
tar xzvf OIL87-RAW.bam.tgz
tar xzvf OIL90-RAW.bam.tgz



#!ls *.bam | awk '{print "anvi-init-bam "$0" -o " $0 ".bam" }' | sed -e 's/-RAW.bam.bam/.bam/'


anvi-init-bam OIL11-RAW.bam -o OIL11.bam
anvi-init-bam OIL14-RAW.bam -o OIL14.bam
anvi-init-bam OIL17-RAW.bam -o OIL17.bam
anvi-init-bam OIL25-RAW.bam -o OIL25.bam
anvi-init-bam OIL28-RAW.bam -o OIL28.bam
anvi-init-bam OIL29-RAW.bam -o OIL29.bam
anvi-init-bam OIL31-RAW.bam -o OIL31.bam
anvi-init-bam OIL32-RAW.bam -o OIL32.bam
anvi-init-bam OIL34-RAW.bam -o OIL34.bam
anvi-init-bam OIL37-RAW.bam -o OIL37.bam
anvi-init-bam OIL44-RAW.bam -o OIL44.bam
anvi-init-bam OIL47-RAW.bam -o OIL47.bam
anvi-init-bam OIL5-RAW.bam -o OIL5.bam
anvi-init-bam OIL50-RAW.bam -o OIL50.bam
anvi-init-bam OIL53-RAW.bam -o OIL53.bam
anvi-init-bam OIL61-RAW.bam -o OIL61.bam
anvi-init-bam OIL64-RAW.bam -o OIL64.bam
anvi-init-bam OIL67-RAW.bam -o OIL67.bam
anvi-init-bam OIL70-RAW.bam -o OIL70.bam
anvi-init-bam OIL78-RAW.bam -o OIL78.bam
anvi-init-bam OIL8-RAW.bam -o OIL8.bam
anvi-init-bam OIL81-RAW.bam -o OIL81.bam
anvi-init-bam OIL82-RAW.bam -o OIL82.bam
anvi-init-bam OIL84-RAW.bam -o OIL84.bam
anvi-init-bam OIL85-RAW.bam -o OIL85.bam
anvi-init-bam OIL87-RAW.bam -o OIL87.bam
anvi-init-bam OIL90-RAW.bam -o OIL90.bam



#!ls *.bam | awk '{print "anvi-init-bam "$0" -o " $0 ".bam" }' | sed -e 's/-RAW.bam.bam/.bam/'

anvi-profile -i OIL11.bam --sample-name od_0 -c 02_CONTIGS/C_MAG2-contigs.db
anvi-profile -i OIL14.bam --sample-name d_0  -c 02_CONTIGS/C_MAG2-contigs.db
anvi-profile -i OIL17.bam --sample-name odn_0    -c 02_CONTIGS/C_MAG2-contigs.db
anvi-profile -i OIL25.bam --sample-name bc_1    -c 02_CONTIGS/C_MAG2-contigs.db
anvi-profile -i OIL28.bam --sample-name o_1_1    -c 02_CONTIGS/C_MAG2-contigs.db
anvi-profile -i OIL29.bam --sample-name o_1_2    -c 02_CONTIGS/C_MAG2-contigs.db
anvi-profile -i OIL31.bam --sample-name od_1_1    -c 02_CONTIGS/C_MAG2-contigs.db
anvi-profile -i OIL32.bam --sample-name od_1_2 -c 02_CONTIGS/C_MAG2-contigs.db
anvi-profile -i OIL34.bam --sample-name d_1    -c 02_CONTIGS/C_MAG2-contigs.db
anvi-profile -i OIL37.bam --sample-name odn_1    -c 02_CONTIGS/C_MAG2-contigs.db
anvi-profile -i OIL44.bam --sample-name bc_2    -c 02_CONTIGS/C_MAG2-contigs.db
anvi-profile -i OIL47.bam --sample-name o_2    -c 02_CONTIGS/C_MAG2-contigs.db
anvi-profile -i OIL50.bam --sample-name od_2    -c 02_CONTIGS/C_MAG2-contigs.db
anvi-profile -i OIL53.bam --sample-name d_2 -c 02_CONTIGS/C_MAG2-contigs.db
anvi-profile -i OIL5.bam --sample-name bc_0    -c 02_CONTIGS/C_MAG2-contigs.db
anvi-profile -i OIL61.bam --sample-name bc_3    -c 02_CONTIGS/C_MAG2-contigs.db
anvi-profile -i OIL64.bam --sample-name o_3    -c 02_CONTIGS/C_MAG2-contigs.db
anvi-profile -i OIL67.bam --sample-name od_3    -c 02_CONTIGS/C_MAG2-contigs.db
anvi-profile -i OIL70.bam --sample-name d_3    -c 02_CONTIGS/C_MAG2-contigs.db
anvi-profile -i OIL78.bam --sample-name bc_4    -c 02_CONTIGS/C_MAG2-contigs.db
anvi-profile -i OIL81.bam --sample-name o_4_1 -c 02_CONTIGS/C_MAG2-contigs.db
anvi-profile -i OIL82.bam --sample-name o_4_2    -c 02_CONTIGS/C_MAG2-contigs.db
anvi-profile -i OIL84.bam --sample-name od_4_1    -c 02_CONTIGS/C_MAG2-contigs.db
anvi-profile -i OIL85.bam --sample-name od_4_2    -c 02_CONTIGS/C_MAG2-contigs.db
anvi-profile -i OIL87.bam --sample-name d_4    -c 02_CONTIGS/C_MAG2-contigs.db
anvi-profile -i OIL8.bam --sample-name o_0    -c 02_CONTIGS/C_MAG2-contigs.db
anvi-profile -i OIL90.bam --sample-name odn_4    -c 02_CONTIGS/C_MAG2-contigs.db

Don't delete bam files yet. Sorted bam files will be needed in cell 28.

Here we need to go to display pangenome and we generate a single bin called "EVERYTHING" and then we summarize it

anvi-display-pan -p COLWELLIA/Colwellia-PAN.db -g COLWELLIA_GENOMES.db
`

The following R script was used to incorporate gene coverage and gene detection profiles from the metatranscriptomes into the pangenome.

#!/usr/bin/env Rscript
rm(list=ls());
graphics.off();
setwd("/Users/tito_miniconda/JOYE_LAB_ANVIO_PROJECTS/SK_BACKUP/p28_pangenomes/Colwellia")

#Libraries
library("dplyr")

#Input data

#Gene clusters from the pangenome
gene_clusters_df <- read.table(file = 'SUMMARY/Colwellia_gene_clusters_summary.txt', header = TRUE, sep = "\t", quote = "")

#Gene coverage profiles from the metatranscriptomic profiles
gene_coverages_df <- read.table(file='gene_cov_n_detection.txt-GENE-COVERAGES.txt', header=TRUE, sep="\t", quote="")

#Gene detection profiles from the metatranscriptomic profiles
gene_detection_df <- read.table(file = 'gene_cov_n_detection.txt-GENE-DETECTION.txt', header=TRUE, sep="\t", quote="")

#Let's remember that gene_clusters_df includes entries for all of the genomes stored in the pangenome. 
#At some point we need to subset those gene_callers_id entries that belong to the reference genome of the 
#transcriptomic merged profiles.

#'C_psychrerythraea_1' was the tag utilized for Colwellia psychrerythraea 34H

ref_genome =  'C_MAG2'

#Creating zero dataframes to store processed data
gene_cluster_names <- gene_clusters_df[,colnames(gene_clusters_df) %in% c('gene_cluster_id')]
gene_cluster_names <- unique(gene_cluster_names)
samples_names <- colnames(gene_coverages_df)
samples_names <- samples_names[2:length(samples_names)]
num_cols = length(samples_names)
num_rows = length(gene_cluster_names)
out_coverage_df = data.frame(matrix(0,ncol = num_cols, nrow = num_rows))
colnames(out_coverage_df ) <- samples_names
rownames(out_coverage_df ) <- gene_cluster_names
out_detection_df <- out_coverage_df 

#Loop to calculate the maximum coverage and maximum detection observed on each gene cluster.
for (gene_cluster in unique(gene_clusters_df$gene_cluster_id)){
  df <- gene_clusters_df[gene_clusters_df$gene_cluster_id == gene_cluster, ]
  df2 <- df[df$genome_name == ref_genome,]
  if(length(df2$gene_callers_id) > 0){
    for(my_gene_callers_id in df2$gene_callers_id){
      #For coverages
      df3 <- gene_coverages_df[gene_coverages_df$key == my_gene_callers_id, ]
      x1 <- as.vector(df3)
      x2 <- x1[2:length(x1)]
      #For detections
      df4 <- gene_detection_df[gene_detection_df$key == my_gene_callers_id, ]
      x3 <- as.vector(df4)
      x4 <- x3[2:length(x3)]

      for(i in 1:length(x2)){
        #For coverages
        the_max = max(x2[i],out_coverage_df[gene_cluster,i])
        out_coverage_df[gene_cluster,i] <- the_max
        #For detections
        the_max2 = max(x4[i],out_detection_df[gene_cluster,i])
        out_detection_df[gene_cluster,i] <- the_max2
      } 

    }

  }

}

#Adding prefixes and merging dataframes into one
colnames(out_coverage_df) <- paste('cov',colnames(out_coverage_df),sep='_')
colnames(out_detection_df) <- paste('det',colnames(out_detection_df),sep='_')

out_coverage_df <- tibble::rownames_to_column(out_coverage_df, "gene_cluster_id")
out_detection_df <- tibble::rownames_to_column(out_detection_df, "gene_cluster_id")

output_df <- merge(out_coverage_df, out_detection_df,by="gene_cluster_id")

#Export data frame into an anvio friendly table
write.table(output_df, "gene_clusters_additional_data.txt", quote=FALSE, sep="\t", na="", row.names=FALSE)

Now let's import our table into the Pangenomic anvio database

Filtering pangenome by mapping recovery

First we want to collapse or split those gene clusters that did not recover transcriptomic reads on any library. To do this:

  1. Create a bin called no_signals and select everything. Tip: Choose a yellow color for this.
  2. Create a bin called T_signals. Tip: Choose a black color for this.
  3. Having the can T_signals selected, use the expression Det X $> 0$, where X is any of the treatments, and then click on append to selected bin. You will have to loop manually through each treatments.
  4. Save the bin collection as bins_by_recovery.

This would look like this before splitting.

We are going to proceed to split by recovery:

From now on we are interested on the fraction of the pangenome that recovered transcriptomic signals.

anvi-display-pan -p SPLIT_PANs/T_signals/PAN.db -g COLWELLIA_GENOMES.db

Differentially Expressed Genes Layers

Counting mapped reads per gene using htseq

This section I am following https://metagenomics-workshop.readthedocs.io/en/latest/annotation/quantification.html. First we need to extract the a GFF file from the annotation of anvio.

Load htseq in docker.

docker pull lbmc/htseq:0.13.5
docker run --oom-kill-disable -v /Users/tito_miniconda/Data:/opt/Data -i -t lbmc/htseq:0.13.5 bash

The following lines were run in runcell29.sh:

!htseq-count -r pos -t CDS -f bam --minaqual 2 /Volumes/Transcend/SK/p28_pangenome/Colwellia/sorted_bam/OIL11.sorted.bam ref_genome.gtf > od_0_mapcounts.txt
!htseq-count -r pos -t CDS -f bam --minaqual 2 /Volumes/Transcend/SK/p28_pangenome/Colwellia/sorted_bam/OIL14.sorted.bam ref_genome.gtf > d_0_mapcounts.txt
!htseq-count -r pos -t CDS -f bam --minaqual 2 /Volumes/Transcend/SK/p28_pangenome/Colwellia/sorted_bam/OIL17.sorted.bam ref_genome.gtf > odn_0_mapcounts.txt
!htseq-count -r pos -t CDS -f bam --minaqual 2 /Volumes/Transcend/SK/p28_pangenome/Colwellia/sorted_bam/OIL25.sorted.bam ref_genome.gtf > bc_1_mapcounts.txt
!htseq-count -r pos -t CDS -f bam --minaqual 2 /Volumes/Transcend/SK/p28_pangenome/Colwellia/sorted_bam/OIL28.sorted.bam ref_genome.gtf > o_1_1_mapcounts.txt
!htseq-count -r pos -t CDS -f bam --minaqual 2 /Volumes/Transcend/SK/p28_pangenome/Colwellia/sorted_bam/OIL29.sorted.bam ref_genome.gtf > o_1_2_mapcounts.txt
!htseq-count -r pos -t CDS -f bam --minaqual 2 /Volumes/Transcend/SK/p28_pangenome/Colwellia/sorted_bam/OIL31.sorted.bam ref_genome.gtf > od_1_1_mapcounts.txt
!htseq-count -r pos -t CDS -f bam --minaqual 2 /Volumes/Transcend/SK/p28_pangenome/Colwellia/sorted_bam/OIL32.sorted.bam ref_genome.gtf > od_1_2_mapcounts.txt
!htseq-count -r pos -t CDS -f bam --minaqual 2 /Volumes/Transcend/SK/p28_pangenome/Colwellia/sorted_bam/OIL34.sorted.bam ref_genome.gtf > d_1_mapcounts.txt
!htseq-count -r pos -t CDS -f bam --minaqual 2 /Volumes/Transcend/SK/p28_pangenome/Colwellia/sorted_bam/OIL37.sorted.bam ref_genome.gtf > odn_1_mapcounts.txt
!htseq-count -r pos -t CDS -f bam --minaqual 2 /Volumes/Transcend/SK/p28_pangenome/Colwellia/sorted_bam/OIL44.sorted.bam ref_genome.gtf > bc_2_mapcounts.txt
!htseq-count -r pos -t CDS -f bam --minaqual 2 /Volumes/Transcend/SK/p28_pangenome/Colwellia/sorted_bam/OIL47.sorted.bam ref_genome.gtf > o_2_mapcounts.txt
!htseq-count -r pos -t CDS -f bam --minaqual 2 /Volumes/Transcend/SK/p28_pangenome/Colwellia/sorted_bam/OIL5.sorted.bam ref_genome.gtf > bc_0_mapcounts.txt
!htseq-count -r pos -t CDS -f bam --minaqual 2 /Volumes/Transcend/SK/p28_pangenome/Colwellia/sorted_bam/OIL50.sorted.bam ref_genome.gtf > od_2_mapcounts.txt
!htseq-count -r pos -t CDS -f bam --minaqual 2 /Volumes/Transcend/SK/p28_pangenome/Colwellia/sorted_bam/OIL53.sorted.bam ref_genome.gtf > d_2_mapcounts.txt
!htseq-count -r pos -t CDS -f bam --minaqual 2 /Volumes/Transcend/SK/p28_pangenome/Colwellia/sorted_bam/OIL61.sorted.bam ref_genome.gtf > bc_3_mapcounts.txt
!htseq-count -r pos -t CDS -f bam --minaqual 2 /Volumes/Transcend/SK/p28_pangenome/Colwellia/sorted_bam/OIL64.sorted.bam ref_genome.gtf > o_3_mapcounts.txt
!htseq-count -r pos -t CDS -f bam --minaqual 2 /Volumes/Transcend/SK/p28_pangenome/Colwellia/sorted_bam/OIL67.sorted.bam ref_genome.gtf > od_3_mapcounts.txt
!htseq-count -r pos -t CDS -f bam --minaqual 2 /Volumes/Transcend/SK/p28_pangenome/Colwellia/sorted_bam/OIL70.sorted.bam ref_genome.gtf > d_3_mapcounts.txt
!htseq-count -r pos -t CDS -f bam --minaqual 2 /Volumes/Transcend/SK/p28_pangenome/Colwellia/sorted_bam/OIL78.sorted.bam ref_genome.gtf > bc_4_mapcounts.txt
!htseq-count -r pos -t CDS -f bam --minaqual 2 /Volumes/Transcend/SK/p28_pangenome/Colwellia/sorted_bam/OIL8.sorted.bam ref_genome.gtf > o_0_mapcounts.txt
!htseq-count -r pos -t CDS -f bam --minaqual 2 /Volumes/Transcend/SK/p28_pangenome/Colwellia/sorted_bam/OIL81.sorted.bam ref_genome.gtf > o_4_1_mapcounts.txt
!htseq-count -r pos -t CDS -f bam --minaqual 2 /Volumes/Transcend/SK/p28_pangenome/Colwellia/sorted_bam/OIL82.sorted.bam ref_genome.gtf > o_4_2_mapcounts.txt
!htseq-count -r pos -t CDS -f bam --minaqual 2 /Volumes/Transcend/SK/p28_pangenome/Colwellia/sorted_bam/OIL84.sorted.bam ref_genome.gtf > od_4_1_mapcounts.txt
!htseq-count -r pos -t CDS -f bam --minaqual 2 /Volumes/Transcend/SK/p28_pangenome/Colwellia/sorted_bam/OIL85.sorted.bam ref_genome.gtf > od_4_2_mapcounts.txt
!htseq-count -r pos -t CDS -f bam --minaqual 2 /Volumes/Transcend/SK/p28_pangenome/Colwellia/sorted_bam/OIL87.sorted.bam ref_genome.gtf > d_4_mapcounts.txt
!htseq-count -r pos -t CDS -f bam --minaqual 2 /Volumes/Transcend/SK/p28_pangenome/Colwellia/sorted_bam/OIL90.sorted.bam ref_genome.gtf > odn_4_mapcounts.txt


!mkdir 96_htseq_output
!mv *mapcounts.txt 96_htseq_output

The following steps will calculate TPM values for contigs or genes based on count files

TPM values are defined as in Wagner et al (Theory in Biosciences) 2012.

$ TPM_{i} = \frac{rg * rl * 10^6}{f * T}$

rg: reads mapped to gene g

rl: read length

f: feature length

$T = \sum_{i} \frac{rg * rl}{f}$ for all $i$ genes

For this calculation we need to know the average read length

DESeq in R

Please close this notebook and open the notebook DE_Analysis.ipynb to run the differential expression analysis.

Load DE profiles and convert from gene to gene_cluster table

#35 #!sqlite3 -header -csv SPLIT_PANs/T_signals/PAN.db "select * from gene_clusters;" > pan_split_gene_clusters.csv !sqlite3 -header -csv SPLIT_PANs/T_signals/PAN.db "select * from gene_cluster_frequencies;" > pan_split_gene_clusters.csv !mv pan_split_gene_clusters.csv SPLIT_PANs/ pan_split_geneclusters_df = pd.read_csv(r'/Users/tito_miniconda/JOYE_LAB_ANVIO_PROJECTS/SK_BACKUP/p28_pangenomes/Colwellia/SPLIT_PANs/pan_split_gene_clusters.csv') #pan_split_geneclusters_df = pan_split_geneclusters_df[pan_split_geneclusters_df.genome_name == 'C_psychrerythraea_1'] DESEQ_df = pd.read_csv(r'/Users/tito_miniconda/JOYE_LAB_ANVIO_PROJECTS/SK_BACKUP/p28_pangenomes/Colwellia/Colwellia_DE_df.txt', sep='\t') #sig_df = DESEQ_df[((DESEQ_df.padj < 0.05) & (DESEQ_df.zeros_in_bc < 3)) | (DESEQ_df.t_test_p_value < 0.05)] sig_df = DESEQ_df[DESEQ_df.padj < 0.05] sig_df.rename(columns={'Gene':'gene_caller_id'}, inplace=True) #Building output scheme #pan_split_geneclusters_df2 = pan_split_geneclusters_df.drop(columns=['gene_caller_id']) DE_for_anvio_df = pd.concat([pan_split_geneclusters_df2,pan_split_geneclusters_df2,pan_split_geneclusters_df2], axis=1) DE_for_anvio_df.columns= ['gene_cluster_id', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H',] DE_for_anvio_df.A = 0 DE_for_anvio_df.B = 0 DE_for_anvio_df.C = 0 DE_for_anvio_df.D = 0 DE_for_anvio_df.E = 0 DE_for_anvio_df.F = 0 DE_for_anvio_df.G = 0 DE_for_anvio_df.H = 0 DE_for_anvio_df.columns= ['gene_cluster_id', 'DE_d!A', 'DE_d!B', 'DE_o!A', 'DE_o!B', 'DE_od!A', 'DE_od!B', 'DE_odn!A', 'DE_odn!B',] DE_for_anvio_df = DE_for_anvio_df.drop_duplicates()

This is a good time to check with

anvi-display-pan -p SPLIT_PANs/T_signals/PAN.db -g COLWELLIA_GENOMES.db

If everything looks right, it is adviced to go ahead and select the core by:

Barplot of up down DE genes

The following barplot will be added to the anvio plot. Anvio can add a layer to samples but not to the DE layer. So we do it here and later we add in inkscape.

After spending some time in Inkscape this is the current stage of the figure:

DE genes functional (Supplementary) annotation table

Selecting and splitting only GC containing DE genes.

Anvio diagram is getting too difficult to label names. So we are going to select and split bins that only contain DE genes. Selection was done using the search box and now we proceed to split again.

anvi-display-pan -p SPLIT_DE_Genes/DE_genes/PAN.db -g COLWELLIA_GENOMES.db

Static HTML Summary

Now our metapangenome is ready to generate an static HTML output.